A Thermo dynamical Approach for 
Probability Estimation 

Takashi Isozaki 
Sony Computer Science Laboratories, Inc. 
3-14-13 Higashigotanda Shinagawa-ku, Tokyo 141-0022 Japan. 
isozaki@csl.sony. co . j p 

December 13, 2012 

Abstract 

The issue of discrete probability estimation for samples of small size 
is addressed in this study. The maximum likelihood method often suf- 
fers overfitting when insufficient data is available. Although the Bayesian 
approach can avoid overfitting by using prior distributions, it still has 
problems with objective analysis. In response to these drawbacks, a new 
theoretical framework based on thermodynamics, where energy and tem- 
perature are introduced, was developed. Entropy and likelihood are placed 
at the center of this method. The key principle of inference for probabil- 
ity mass functions is the minimum free energy, which is shown to unify 
the two principles of maximum likelihood and maximum entropy. Our 
method can robustly estimate probability functions from small size data. 

1 Introduction 

A method for estimating probability of discrete random variables was developed. 
It is based on the key idea that statistical inference can be described by a com- 
bination of two frameworks, namely, thermodynamics and information theory 
The roles of temperature and entropy in the method are paid special attention. 
In other words, heat is introduced to statistical inference. This method, fur- 
thermore, has no free parameters, including temperature. The proposed method 
makes it possible to unify the maximum likelihood principle and the maximum 
entropy principle for statistical inference even from data of small sample sizes. 

In recent times, the amount of various available data has been growing day 
by day. As a result, a large amount of data not only for one variable but for 
many variables can be obtained. Intuitively, getting conditional probabilities 
and joint probabilities can reduce the entropy of interested variables, which 
is guaranteed by information theoretic inequalities. Note that in this paper a 
capital letter such as X denotes a random discrete variable, a non-capital letter 
such as x denotes the special state of that variable, a bold capital letter denotes 
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a set of variables, and a bold non-capital l etter deno t ed co nfigurations of that 
set. Here, we adopt Gibbs-Shannon entropv lShannonl (|1948T ) as entropy (we call 
this entropy Shannon entropy hereafter), defined as 

H(X) = -Y,P(x)logP(x), 

X 

where P is a probability mass function. The inequalities are thus given as 



and 



H(X) > H(X\Y) 



H(X) + H(Y)>H(X,Y), 



w here the conditi on of the equality is independence between variables X and 
Y Shannon (1948). It is therefore preferable to use data generated from many 
variables because, obviously, highly predictive statistical inference requires low- 
entropy parameters. In discrete, many- variable systems, statistical estimation of 
conditional probabilities and joint probabilities needs exponentially large data 
because of combinatorial explosion of events in variables. Although the maxi- 
mum likelihood (ML) principle and methods play significant roles in statistics 
and are regarded as the most-general principle in statistics, it is known that ML 
methods often suffer overfitting and that they are ineffective in cases such as 
insufficiently large data size in relation to number of parameters. 

The situation that ML methods often suffer overfitting in multivariate statis- 
tical analysis with many parameters seems to make Bayesian statistics more and 
more attractive from the viewpoint of avoiding overfitting. Bayesian statistics 
incorporates background knowledge, which compensates for shortage of data 
size an d increasingly becomes popular in natural science including physics iDosd 
(2003). It can be said that Bayesian statistics adds prior imaginary frequency 
of events to real one. Furthermore, even in the cases of no available prior knowl- 
edge or the case of public analysis which needs to preclude prior knowledge for 
avoiding generation of unnecessary bias, Bayes ian statistics can reduce ove r- 
fitting by means of noninformative priors (e.g., Kass and Wasserman ( 19961) ). 
For example, in discrete random-variable systems, Bayesian statistics usually 
uses Dirichlet distributions, which always have parameters. The parameters 
of Dirichlet prior distributions are often interpreted as prior samples. When 
those parameters are uniform and express no special prior knowledge, they can 
increase entropy of ML estimators and thereby make estimated probabilities 
more robust than those obtained from ML estimation. This feature of those 
parameters can be regarded as one to generalize the principle of insufficient 
reason proposed by Laplace. 

Althoug h noninformative priors ha v e been widely used, they still h a ve some 
proble ms Kass and Wasserman ( 1996 ): Irony and Singpurwalla ( 1997t h [Robert! 
( 2007 ). For example, Jeffreys' priors Jeffreys! ( 196ll ). which are the most widely 
accepted noninformative priors, do not satisfy axioms of probability and are 
thus said to be improper distributions. In addition, even the posterior can be 
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still improper IKass and Wasserman (Il996h . As for statistical inference, it is 
thus probably reasonable to explore another principle or theoretical framework. 

A theoretical framework for estimating probability of discrete variables used 
in objective analysis is proposed in the following. The new framework keeps the 
good characteristics of Bayesian statistics: increasing entropy obtained from 
ML estimators in the case with no prior knowledge. Entropy is regarded as rep- 
resenting the uncertainty of information. Accordingly, the entropy in the case 
of insufficient data size should be larger than that obtained from hidden true 
distributions, because the smaller the data size is, the higher uncertainty be- 
comes. That is, we consider that entropy consists of uncertainty due to limited 
sample size and uncertainty due to true probability distributions. It is there- 
fore proposed that probability estimation should be regarded as searching for 
the optimal value of entropy according to available data size and data prop- 
erty. However, regarding limited sample size (which increases uncertainty), 
neither a satis f actory principle nor a method for optimally estimating entropy 
exists. Javnesl (1957) proposed a method based on the maximum entropy prin 



ciple, which has been used in s ome domains including phy sics (e.g.. lBerger et al 



( 19961) ; iHuscroft etall (|200Clh ; ICaticha and Preussl (|20o4 ) . Our method utilizes 
information of frequencies like ML methods in addition to ME principle, both 
of which cooperate to modify over biases due to small samples, while Jaynes' 
method does not. We regard the difference is an essential one between the both 
methods. 

The theoretical framework on which the proposed probability estimation 
method is based consists of and unifies two well-known principles. The first is 
the maximum likelihood (ML) principle which states that the best estimators 
should most duplicate data and which is very effective in the case of sufficiently 
large data size. The second is the maximum entropy (ME) principle, which 
states that no bias should be applied to particular internal states of variables, 
within some constraints, as far as possible. However, each principle is contrary to 
the other because the expectation values of minus-log likelihoods are the same 
as empirical entropy. In clear contrast to the ME principle, it is intuitively 
obvious that obtaining the estimator with the lowest entropy and the highest 
likelihood is preferable in the case of sufficiently large data. 

Given the above-described conflict between ML and ME principles, it is 
necessary to devise a method for analyzing samples with insufficient data size. 
It seems natural that there is a balancing value between of the entropies given 
by the ML and ME principles. If it is assumed that such a balancing value 
exists, it is necessary to find the optimal point between contrary principles in 
statistical inference. 

In a branch of natural science, namely, thermodynamics, there is an analogy 
with t h e above - descr ibed trade-off between the ML and ME principles lKittel and Kroemer 
( 19801) : I C alien! ( Il985l) . In thermodynamics, nature selects the state that achieves 
a balance between the minimum energy state and the maximum entropy state 
at a finite temperature. It is assumed here that this analogy applies to sta- 
tistical inference in discrete variables. Consequently, temperature, which plays 
the role of a unit of measure, is introduced. This approach is an extended one 
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from our preceding works llsozaki et al. I (l2008ll2009l) . in which temperature was 
represented by an artificial model containing a free hyperparameter. In the 
present work, temperature is entirely redefined in a new method with no free 
hyperparameters. According to our proposed method, a new interpreta tion of 
proba bility estimated from data is presented, which is neither freq uent ism lHai ek 
( 1997t ) nor Bayesianism. 

In the machine learnin g domain, some simi l ar me t hods to ou r s have been pro- 

posedlPereira et al. l(ll993l):IUeda and Nakand (|l995l) : lHofinannl (Il999h : iLeCun and Huang] 
(|2005l) : IWatanabe et all f|2009h . Nevertheless, many studies that have applied 
free energy to statistical science have not included temperature or treated as a 
controlled parameter, fixed parameter or a free parameter, apparently because 
of the lack of clarity of its meaning in data science. In regard to the existing 
researches, therefore, we consider that the potentials of free energies are not well 
extracted. Similar methods in context of robust estimation, in which a free pa- 
ramet e r is introduced in a similar fashion, hav e also been investigated I Windham 
(1995); Basu et al. ( 1998 ); Jones et al. ( 2001 ). where how to determine the free 
parameter for small samples still remains all the same. 

This paper is organized as follows. In the next section, the basic theory 
based on thermodynamics is explained. The proposed "probability estimation 
method" is introduced in Section [3J where estimation methods for joint prob- 
abilities and conditional probabilities are also proposed. Section 0] presents 
the results of experiments using the probability estimation method. The rela- 
tionships between our method and classical/Bayesian statistics are discussed in 
Section [5] Section [6] concludes this study. 



2 Basic theory 

In constructing an estimation method of finite discrete probability distributions 
from samples with finite size, we utilize both Shannon entropy and likelihood. 
However, a new principle is needed for combining the two concepts; accordingly, 
we assume the principle to do it is in the thermodynamical framework. In the 
following, therefore, entropy, energy, temperature, and Helmholtz free energy are 
defined for the purpose. Necessary postulates for constructing the method and 
its properties are then described. Hereafter, multivariate random systems are 
treated without any prior knowledge. All probability distributions are assumed 
to be discrete variables, and samples are assumed to be i.i.d. data. An extension 
to the case with available prior knowledge is discussed in section 
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2.1 Definitions and postulates 
2.1.1 Entropy 

Definition 2.1 (Entrop y) The entropy H(X) of a discrete random variable 
X is defined, according to \Shannon\ \l9l& ), as 

H(X):=-'£p(x)\ogP(x). (1) 

X 

IfP(X) is an estimated probability function, H(X) is also an estimated function 
of P{X). Entropy is also denoted as H(P(X)) or H(P) in order to make it clear 
which distributions are used. 

Joint entropy H(X) of multivariate s y stems and conditional entr opy H(X \ Y) 
are denned as follows Shannon ( 19481) ; ICover and Thomas ( 2006 ) : 



H{X):=-Y J P{^ogP{x) (2) 

X 

and 

H(X \Y):=-J2 Pfa V) log P{* I V)- (3) 

r.y 

It should be noted that probability mass function P(x) and entropy H are 
quantities that should be estimated from data in this study. Accordingly it 
should be emphasized that the entropy has two aspects of uncertainty: The 
first is the uncertainty that each true probability distribution peculiarly has; the 
second is that which comes from finiteness of available data. To the author's 
knowledge, the second aspect of entropy has not been specifically discussed. 
Accordingly we introduce a mechanism to estimate the optimal uncertainty 
under given finite available data. 

2.1.2 Energy 

The (internal) energy of a probability system is defined as follows. First, a 
distance-like quantity between two distributions is defined in the usual way as 
follows. 

Definitio n 2.2 (Kullback-Leibler divergence) The Kullback-Leibler (KL) 
divergence \Kullback and Leible\ \l95\ ) between two distributions of a random 
variable X , i.e., P(X) and Q(X), is given as follows: 

D(P(X)\\Q(X)):=J2P(^og^. (4) 

For multivariate systems, D(P(X) \ \ Q(X)) can be defined in the same manner. 
Conditional KL divergence is defined as 

D(P(X | Y) 1 1 Q(X | Y)) := g P(x, y)||^y- (5) 
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Next, the cross entropy is denned, which is also useful to represent the energy. 



Definition 2.3 (Cross entropy) The cross entropy of discrete random vari- 
able X between probability distributions P(X) and Q(X), i.e., H(P(X),Q(X)), 
is defined as 

H(P(X),Q(X)):=-J2P(x)logQ(x). (6) 

X 

Cross entropy is also denoted as H(P,Q). The following relationship between 
KL divergence and cross entropy is easily derived: 

D(P(X) \\Q(X)) + H(P(X)) - H(P(X),Q(X)). (7) 

According to Jensen's inequality, D(P \ \ Q) > ICover and Thomas (200G) and 
H(P,Q)>H(P). 

Empirical distribution functions are defined in a usual way. 

Definition 2.4 (Empirical distributions) It is assumed that there are N 
samples of random variable X: {y^~\ y^ N ^}. An empirical distribution of 
X , P{X), is defined as 



1 N 

P (x = x) = -J2 s ( x -y ( % (s) 



i=l 



where S(x — y) = 1 if x = y and 8{x — y) — if x =/= y. 

P(X) is relative frequency, that is, a maximum likelihood (ML) estimator, which 
is denoted by P(X). 

Definition 2.5 (Information energy) (Internal) energy is defined by using 
Kullback-Leibler divergence as a distortion between the distribution of a target 
and an empirical distribution: 

U (X):=D(P 1 (X)\\P 2 (X)), (9) 

where P\(X) denotes a target mass function to be estimated, and P2(X) denotes 
an empirical function or the ML estimator. Cross entropy can also be used as 
an alternative of the KL divergence: 

U(X) := HiP^X), P 2 (X)). (10) 

Uq and U are, hereafter, both called "information energy". 

It is noteworthy that minimizing Uq for probability estimation corresponds to 
the ML principle. 

Self-information energy of functions of X or X , namely, e, are defined as 

e(X) := -logP(X), (11) 



G 



where P(X) is the empirical distribution, and e(X) can be defined for any states 
x of X when P{x) > 0. Practically, the ML estimator, P(X), can be used as 
an alternative to P(X). That is, the following equation is used: 



e(X); 



\ogP(X). 



(12) 



This equation indicates that self-information energy denotes minus maximum- 
log likelihood. 

2.1.3 Temperature 

Inverse temperature, /?o, is introduced as one of the most significant quantities 
for statistical inference with finite data size. /3o is used instead of physical 
temperature, often denoted as T, and is simply called "temperature" hereafter. 
Temperature is regarded as a bridge between t hermodynamics and statis tical 
inference. As described in our preliminary work llsozaki et al ] (|2008l l2009h . by 
introducing a thermodynamical framework for statistical inference, fluctuation 
due to finiteness of available data size can be regarded as thermal fluctuation. 
In the following, this philosophy is applied to define temperature in this paper 
for constructing a probability estimation method. 

Fluctuation which data have can be denoted by the distortion between the 
ML estimator in currently available data size n and the probability function 
estimated from the new framework by using n — 1 data (which do not include 
the nth data). The fluctuation is related to the temperature as a unit of measure. 
Pi(X) is first defined as a new estimator obtained from i data, and averaged 
estimator P^(X), which denotes the geometric mean for data size n (> 0), is 
defined as follows: 



where P^(X) := Pq(X) := 1/\X\ is defined as a uniform function, in which \X\ 
denotes the number of elements in the range of X. This definition for n = 
corresponds to the ME principle. It follows that the distortion is denoted as KL 
divergence, and the divergence is connected to the temperature. 

Definition 2.6 (Data temperature) For a natural-number data size, i.e., 
n > 1, the inverse temperature of a random variable X, namely, Pq{X), is 
defined as 



for D(P^_ 1 (X)\\P(X)) 0, where P°_ X {X) is defined by Equation [T3\). 
Po(X) := for data size m = 0, and (3q(X) of multivariate systems, i.e., 
X, can be defined in the same way. Note that variables X and X are often 
omitted if they can be clearly recognized. 




(13) 



MX) 



1ID{PZ_^X)\\P{X)) 




(14) 
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We call Po "data temperature" . According to this definition, it can be assumed 
that < Po < oo for n > 0. It will be seen that this positivity of temperature has 
consistency with a postulate described in 12.1.51 and Equation (j3"3")l . In general, 
/So becomes large, that is, the system approaches low-temperature state, as 
available data size grows because the fluctuation of data becomes small. 

Normalized data temperature in statistical inference, which improves tractabil- 
ity of data temperature in mathematical formulas, is defined as follows: 

Definition 2.7 (Data temperature II) For a natural-number data size, i.e., 
n > 1, temperature of a random variable, f3{X), is defined as 

/3(X) is defined in the same way, and the variable(s) name is often omitted. 
According to this definition, < j3 < 1 for n > 0. /3 := for data size 
m = 0. Hereafter, both /3 and /3q compatibly are used, and both are called "data 
temperature " when there is in no danger of confusion. 

2.1.4 Helmholtz free energy 

Hclmholtz free energy, which plays a significant role in the method we present, 
is introduced next. 

Definition 2.8 (Helmholtz free energy) Helmholtz free energy, F for X , is 
defined by using information energy, Uq(X), Shannon entropy, H(X), and data 
temperature /3q(X), as follows: 

WWD-f^. (16) 

Free energy can be equivalently rewritten using j3 and cross entropy U , instead 
of /3q and Uq, as 

F(X):=U(X)-^. (17) 

For multivariate systems, F(X) can be defined in the same way. 

The second terms of right-hand sides in Equations (fT6|) and (fT7|) represent ther- 
mal energy in thermodynamics. It follows that the concept of heat is explicitly 
introduced in statistical inference. 

2.1.5 Postulates 

To assure the positivity of temperature, the following postulate, which will be 
needed in 13.3.11 is first assumed. 
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Postulate 2.1 Entropy is differ entiable and is a monotonically increasing func- 
tion of energy. The coefficient of the partial derivative for the energy takes 
positive values as follows: 

dH , s 

8U >0 ' (18) 
where Uq denotes the energy represented by the KL divergence and 

dH , . 

m > °< (19) 

where U denotes the energy by the cross entropy. 

The key principle of our method is assumed as follows. At the large sample limit, 
Uq — > is reasonable in accordance with the ML principle, while it is reasonable 
that H takes maximum values at the small sample limit in accordance with the 
maximum entropy (ME) principle. For a finite sample size, it is thus reasonable 
that estimators of probabilities take the values that balance both principles 
in accordance with the data size and the true hidden intrinsic entropies. We 
postulate that the minimum (Helmholtz) free energy principle determines the 
optimal balance. 

Postulate 2.2 (Minimum free energy principle) Probability mass functions 
that are estimated from data are such as to minimize the Helmholtz free energy. 

We call the principle MFE principle. 

The two above-stated postulates are all that is needed for the framework 
of our proposed method . It i s noteworthy that these postulates are parts of 
thermodynamics C alien ( 19851 ). implying that our method of statistical infer- 



ence is fully based on the framework of thermodynamics theory (except for the 
interpretation of the entropy, for which Shannon entropy is adopted) . 

Our new method developed here selects a probability mass function that 
maximizes entropy as far as possible according to the minim um free energy 
principle, while the maximum entropy method of ljavne"sl (1957) maximizes the 



entropy subject to his adopted another constraint. The new method even cor- 
rects the bias generated from the limited size samples, which is a major differ- 
ence compared to Jaynes' method. We consider that the difference arises from 
the theoretical ground, which of our method is thermodynamics with Shannon 
entropy while which of Jaynes is only information theory. 

The basis of our method for inference is described by using Shannon entropy 
and introducing "information energy" , "data temperature" , and the minimum 
free energy (MFE) principle. When we estimate probability functions from finite 
data, its purpose is getting effective information from data for recognizing truth 
and/or predicting future events. From this viewpoint, with a finite data size, it is 
reasonable to select a probability function that explains data to some extent but 
has some additive uncertainty due to having limited samples. MFE principle, 
thus, unifies ML and ME principles, and thermodynamics also has a similar 
relationship: MFE principl e unifie s mini mum (internal) energy principle and 
maximum entropy principle ICallenI (Il985h . 
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3 Probability estimation 



A probability-estimation method based on the theory described in the previous 
section is formalized in the following. The estimation is based on the MFE 
principle. Multinomial distributions are used for discrete random variables in 
the usual manner. 



3.1 Probability estimation method 

The entropy of a variable, X, is first defined by Equation (TTJ). Information 
energy is then defined by Equation ([9]), probability P(X), and the empirical 
distribution (P(X)) as 

U (X) := D(P(X) 1 1 P(X)) = J2 P(X) log Sj. (20) 

P(X) is replaced by the ML estimator P(X) because P(X) denotes relative 
frequency, which is the same as P(X) for binomial or multinomial distributions 
under the condition of i.i.d. 

According to the MFE principle, probability estimator P(X) with (3q or /3 
is estimated by minimizing F under the constraint P(x) = 1. It is therefore 
solved by using Lagrange multipliers. Free energy F is written in the following 
form with Shannon entropy and information energy: 

F = U - -±-H. (21) 

Po 

(3 can be used as the alternative to /3o; accordingly, free energy F can be rewrit- 
ten by using cross entropy U as 

F = U-^-H. (22) 

It follows that {U, /?} can be used instead of {Uo,/3o}- The Lagrangian L is 
expressed as 

L = F + X^2P(x)-lj 

X X 

+\^P(z)-lj, (23) 

where A is the Lagrange multiplier. In relation to that expression, if j3o — > 
0, then (3 — > (high-temperature limit); if po — > oo, then f3 —> 1 (low- 
temperature limit). The solution P(X) is thus derived from the following equa- 
tion: dL/dP(x) = 0. The estimated probability, P(X), is therefore expressed in 
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the form of the canonical distributions, which is also called Gibbs distributions. 
The distribution as the solution is well known in statistical physics as 

= expt-fl-logPfr))) 
E^exp(-/3(-logP(arO)) 
exp(-/3e(a;)) 

where e(x) as expressed in Equation (|12p is used. Practically, the following 
equivalent form is used: 

P{x) _ JSL, (26) 

where /3 can be determined, without any free parameters, by using Equations 
(fT5)) . (IT?)) . (fT5|) . and (f2l))) . For data size n = 0, the estimator is defined such 
that P(x) = 1/\X\, where \X\ denotes the number of elements in the range of 
X. Note that the proposed method has consistency with the ME principle, at 
high temperature limit, where mini* 1 « max , and with ML principle, at 

low temperature limit, where mini* 1 w mint/. 

For conditional probability, conditional entropy H(X \ Y) and conditional 
KL divergence D(P(X \Y) \ \ P(X \ Y)) or conditional cross entropy are used. j3 
is defined as /3(X \Y) := /3 (X \Y)/(/3 (X \Y) + 1). The formula for estimating 
conditional probabilities is therefore obtained in the following form: 

p , | x exp(-/3(-logP( a: |y))) 

Fix \y) = . (27 

E,'exp(-/?(-logP(x'|y))) 

For conditional data size n = given Y = y, P(x \ y) = 1/\X\ for any y in the 
same manner as P(x) given in section f3. II 

Joint probability can be calculated by using Equations (f24|) and (|27|) and 
the definite relation: P(X,Y) = P(X\Y)P{Y). In general, it is calculated 
using decomposition rules such that 

P[X\, X%, . . . , X n ) = 

P(X n I X„_!, \,..\, P(X 2 | X^PiXt). 

Partition functions similar to statistical mechanics are introduced for conve- 
nience. By using "data temperature" /3, free energy F is expressed in the same 
form as that in statistical mechanics: 

F = U-±H 

= -J lo & Z , ( 28 ) 

where Z is the partition function, which is well known in statistical mechanics, 
defined for single or multivariate probabilities as 

Z{X) = Y\H*)\ P , (29) 
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and for conditional probabilities as 



Z{X\Y) = Y}P{x\Y)f. (30) 

X 

Consequently, when the thermodynamical framework and Shannon entropy are 
assumed, the partition-function formula of statistics common to statistical me- 
chanics can be derived. 

A significant feature of data temperature is proved as follows. The lemma 
needed for this proof is stated as follows. 

Lemma 3.1 Pi{X) is denoted as the canonical distribution estimated by Equa- 
tion \2J$ from i data. For data size n —> oo, P G (x), defined by Equation U3\) , 
converges to a definite value P G (x) when < Pi{x) for integers i such that 
i > and any state x. 



Proof. 



iogp„ G (z)-io g p;f-i(z) 

71 -.71 — 1 

— - J2 log ^ (x) - - lo § Pi(*) 

n + 1 ' n ' 

i=0 »=0 

l) log P G _ X + -L- \o S P n (x). (31) 



n + 1 ') "°' n - 1 ' n +l 

Because < P G _ 1 (x) and < P n {x) and then logP G _ 1 (x) and logP„(x) are 
definite values, if n oo, both terms on the right-hand side of Equation 1131]) 
converge to 0. Therefore, P G {x) — > P G (x) because log functions are single- 
valued functions. 

Theorem 3.1 At the asymptotic limit (i.e. large sample limit), data tempera- 
ture ft converges to 1 when < Pj (x) for integers i such that i > and any state 
x, where Pi(x) is denoted as the canonical distribution estimated by Equation 
\21$ from i data. 



Proof. According to Lemma \S.l[ P G {x) — > P G (x) at the limit n — > oo, where 
P G (x) is a definite value for any x. P G (x) is also a definite value: P n {x), 
where P n (x) is an estimated value given by Equation \2J$ using n data, because 
\ogP G {x) is a mean value o/logP,(x). (3 thus converges to a definite value. 
Meanwhile, ML estimator P{x) converges to true distribution Pt{x) due to the 
consistency of ML estimators. P n (x) at n — > oo is denoted as P(x). Therefore, 
at n — s> oo, 

^ = ^ = D{P{X)\\P,(X)) 

. E K og Kl (32) 
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for < P{x) and any state x. This identity \32\) needs ft — > or j3 — > 1 in order 
that [P t {x)]P or [P t (x)]^~ 1 is a constant for any probability distributions Pt(x). 
However, (3 — » does not satisfy Equation iSty) , while /3 — > 1 does satisfy the 
equation. Accordingly, j3 — > 1 is the asymptotic limit. 

According to Theorem l3.il the more data is obtained, the more /3 approaches 
1 and the more the estimator approaches the ML estimator. Therefore, it is 
noticeable that the new estimator has the same preferable asymptotic properties 
as the ML has, which are consistency and asymptotic efficiency 

For insufficient data size, f3o is small by definition, so (3 is also small. Ade- 
quate estimators, which are automatically adjusted to available data, can there- 
fore be obtained. In other words, free energy is dominated by the second term 
of Equation (|22|) when sufficient data is not available, because uncertainty is 
large due to shortage of evidence. In contrast, it is dominated by the first term 
when sufficient data is available, because uncertainty is small due to a large 
size of data. We call the proposed method MFEE, which we abbreviate "MFE 
estimation" as. 

3.2 Interpretation of probability and estimated informa- 



MFEE provides a new interpretation of probability instead of frequent ism or 
Bayes ianism. Frequentism is based on counting occurrence of events (e.g. lHaiek 
(1997)) and Bayesianism is based on subjectivity or combination of count- 
ing prior imaginary and real occurrences of events. It can be regarded that 
the Bayesian approach extends frequentism to that including (prior) imaginary 
counting of events. The thermodynamical estimation method stated in this sec- 
tion is based on the concept of optimal uncertainty, which consists of counting 
events and temperature. In MFEE, probability can be regarded as the degree 
of uncertainty according to the MFE principle optimizing uncertainty in a re- 
flection of quality and quantity of the data. It can therefore be regarded as a 
new interpretation of probability in real-world applications. 

When the optimal entropy is obtained by using the MFE principle, the opti- 
mal negative entropy represents the optimally estimated effective information, 
which is defined as EI (denoting "effective information"), for given data as 
follows: 



The averaged log likelihood is therefore a large sample approximation of EI. 

3.3 Some characteristic properties of MFEE 

The canonical distribution derived from the MFE principle can provide some 
characteristic properties of MFEE. The following notations are defined. Proba- 



tion 




X 
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bility mass functions, such as Pk, have discrete states that are denoted as index 
k. The ML estimator is denoted by Pk- {(3, U} is used instead of {/3q, Uq}- 



3.3.1 Relation to usual definition of temperature 



In the rmodynamics, temperature j3 is usually defined as lCallenl (|19851 ): lKittel and Kroemer 
(Il980l) 



However, if the canonical distribution, which has the form of Equation (|24[) . is 
used, the usual definition of /3 is automatically satisfied as follows. 

Lemma 3.2 H = f3U + logZ under the MFE condition, where H, U , /3, and Z 
denote entropy, information energy, data temperature, and partition junction. 

Proof. Since probability mass function Pk has a canonical form under the 
MFE condition, it follows that 

pP pP 

H = -^P fc logP fc = -^^log^ 

fe fe 

F" 3 \ / 



k 



= BU + log Z, (34) 
where cross entropy is used as U . 



Theorem 3.2 Equation H33]) is automatically satisfied under the MFE princi- 
ple. 

Proof. Differentiating partially with respect to U of both sides of Equation 
\3J$ gives 

d dB d 

k k 

pp 

k 



It follows that 



dU P 



The theorem is therefore proved. 

In the same way, it can be proved that /?o = dH/dUa. 
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3.3.2 Energy fluctuation 

In statistical mechanics, energy fluctuations (e 2 ) — (e) 2 are shown to have the fol- 
lowing relation, where () denotes an expectation value in respect to the canonical 
distributions. 

<e*> - W» - (37) 

In regard to energy defined in MFEE, namely, Equation (|T^|) , the same relation 
as that shown here is satisfied. We use the following equation: 

p& 

U = -J2^-\ogP k , (38) 
fc 

where U denotes cross entropy. 



dp ^ b KJ \Z k b K \Z 2 J 88 \ 

K fc I v 7 m ' ) 

{ P 13 P 13 PP 1 

= E( lo §^) \~t logPh ~ ~f logPro 

fc l in J 

pF> ( pp \ 

= ^^fogA logA-E^ 1 ^^ 

fc V m / 

= (e 2 )-( £ ) 2 , (39) 
where we use Equation (fT2|) . The Equation (|37| is therefore proved. 



3.3.3 Pseudo Fisher information and energy fluctuation 

When B is a parameter of probability mass function P, Fisher information 1(B) 
is defined in the usual way as 

1(13) :=$>(/?) (^log/ fc (/3)) 2 , (40) 
fc 7 

where / is the likelihood function. Here, the likelihood function is replaced with 
the probability function and we called the replaced 1(B) MFE-Fisher informa- 
tion 1(B). 

When it is assumed that the probability functions can be expressed by the 
canonical distributions with parameter B, it is clear that 1(B) = (e 2 ) — (e) 2 as 
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follows. 



/3\ 2 



Z \8B ° Z 

Pi 



Y / ^r(l0 g P k -W P ^Prr 



z , 

fe \ m / 

hP ( pP \ 2 

& v & / 

- ( £ 2 }~( £ > 2 . (41) 

The MFE-Fishcr information is therefore identical to the energy fluctuation 
defined in Equation (|57| . 

3.3.4 Other similarities with statistical mechanics 

It is noteworthy that MFEE has other similarities with thermodynamics and/or 
statistical mechanics. That is, the same relationships exist. We list those below: 

• The following relation is easily derived from the definition of partition 
function Z: 

U = ~^lo g Z. (42) 

• The following relation, known as the Gibbs-Helmholtz relation, is derived 
from Equations ([28| and (j42]l as follows: 

U=^m. (43) 

• The following relation is simply obtained from Equations (|2"2"|) and (|4"3")l : 

(44) 

• The energy variance is represented by the second-order differential of the 
partition function for /3 as 

(t 2 )-(t) 2 = -^logZ. (45) 

• Thermal capacity is represented by data temperature and energy fluctua- 
tion as 

,2 * 



d/3 2 

/3 2 «6 2 ) - ( £ ) 2 ). (46) 
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3.4 Incorporating prior knowledge 

MFEE can incorporate Bayesian subjective beliefs. It can be said that Bayesian 
subjective probabilities extend counts of events to the sum of those counts and 
prior imaginary counts of the events. Our method can therefore easily include 
subjectivity. To achieve this extension, the ML estimators are replaced with 
Bayesian posterior point probabilities such as those estimated by the maximizing 
a posterior (MAP) method, which leads to the Bayesian canonical distributions 
instead of the formula expressed by Equation (1241 as following: 

p( s = exp(-P(-\ogP Bayes (x))) 
[X> E*'exp(-/3(-logP Boi , es «)))' 1 ' 

where PBayes{%) is the Bayesian posterior point probability, and data tem- 
perature is calculated from Equation (fl"4|) where ML estimator is replaced by 
PBayes(x). In this case, the stronger the subjectivity, the more the data tem- 
perature, that is, the tempering effect caused by (3 is weaker. 

4 Examples 

Simulations to demonstrate the robustness for small samples of MFEE, in com- 
parison with ML, ME, and Bayesian-Dirichlet estimators with Jeffreys' prior, 
are described. 

X is assumed to have three internal states and four probability mass func- 
tions with a variety of entropies denoted as H(X) in natural logarithms: 



1. 


P(x = 


0) = 


0.431, P{x = 1) 


= 0.337, P(x = 2) = 


0.232, H(X) = 1.07, 


2. 


P(x = 


0) = 


0.677, P(x = 1) 


= 0.206, P(x = 2) = 


0.117, H(X) = 0.841, 


3. 


P(x = 


0) = 


0.851, P(x = 1) 


= 0.117, P{x = 2) = 


0.0320, H(X) = 0.498, 


4. 


P(x = 


0) = 


= 0.9898, P{x = 


1) = 0.00810, P(x -- 


= 2) = 0.00210, H (X) 




0.0621 











Data from each function was sampled, and probabilities were estimated from 
given data sets with various data sizes. In the estimation, ML, ME, and 
Bayesian-Dirichlet estimation with hyperpar ameter a = 1 /2 derived from Jef- 
freys' prior distribution on Dirichlet models ( Robertl . 12007 , p. 130) were used. 



The maximizing a posterior (MAP) was used for Bayesian-Dirichlet estimation 
from the viewpoint of point estimations. We set, as usual, averaged outputs 
(X) as the constraint in the ME method as following: (X) := (1/N) Yld=i ^d, 
where X<i denotes d-th sample's output and N denotes a sample size. After that, 
true and es timated probabilities were co mpared by using Kullback-Leibler (KL) 



divergence iKullback and Leiblerl ([19511 ) as a metric, which has the following 
form: 

D(P(X) 1 1 P e (X)) = ]T P(x) log (48) 
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where P(X) is the true distribution, and P C (X) is the distribution estimated 
by ML, ME, MAP, or MFEE. For avoiding zero probabilities, probabilities were 
smoothed by adding 0.0001 to the counts. 

The KL divergences are shown in Fig. [TJ where they are averaged values 
from 100 samples at each sample size from identical distributions. It can be 
seen that the ML estimators are inferior to MFEE due to overhtting, except for 
the distribution having very small entropy. Even the degree of superiority of 
the ML estimation in (d) is relatively smaller than that of inferiority in other 
distributions. The ME estimators showed the opposite behaviors to the ML, 
and showed some relatively poor results in large-sample regions. ML methods 
tend to fit data and then can more accurately estimate distributions with very 
low entropies than others in small sample cases. For example, if a true entropy 
equals zero, ML method can estimate the exact true distribution from only 
one sample. On the other hand, ME methods tend to increase entropies and 
then can with high entropies such as the uniform distributions. Hence, the ML 
has tendency of overhtting and the ME has that of underhtting in the view of 
misestimation. Even so MFEE showed relative stability in both sample sizes 
and distributions. It indicates effectiveness of MFEE method, as was defined 
so as to incorporate characteristics of both the ME and ML. Not all values 
could be estimated by the MAP at each sample size for the following sample 
sizes (N): N < 19 in (a), N < 50 in (b), N < 103 in (c), and N > 500 in 
(d), where the size decreased in response to the values of entropies for each 
distribution. This is because even the posterior probability distributions were 
improper distributions, which have be en pointed out as a problem of Bayesian 
statistics Kass and Wassermanl ( 19961 ). In these ranges, the averaged points 
about the MAP were not plotted in Fig. Q] Our method always provides the 
values for any sample size and showed effectiveness for avoiding overhtting (at 
least to some extent). 



5 Discussion 

5.1 Relation to classical and Jaynes' approach 

The classical approach to probability estimation based on frequentism is in- 
cluded in our method. This approach can be considered as a method where (3 is 
assumed to be 1 or nearly 1 in our MFE-based method. That is, the approach 
can be said to be a zero-temperature or a low-temperature approximation of our 
method. 

MFEE suggests a new interpretation of probability, in which sample size is 
not explicitly included, unlike both the classical and Bayesian approaches. Al- 
though an equivalent sample size can be calculated from a probability estimated 
by MFEE, the calculated value is no more than the one as interpreted in the 
language of frequentism. 

Jaynes' maximum entropy (ME) methods are well known as the least bi- 
ased inference methods. However, the constraints on which ME methods are 



18 



based may not be reliable for small samples and then may be biased. On the 
other hand, MFEE corrects even such biases using temperature. Moreover, ME 
methods seemed to fail in estimation from large size-samples in our simulations, 
which implies that ME methods do not fully take advantage of information from 
available data. 



5.2 Relation to Bayesian approach 

Our approach is quite different from Bayesian approaches when prior knowledge 
is not available or should be excluded. MFEE assumes that a physics-like mecha- 
nism determines optimal estimation, while the Bayesian approach assumes non- 
informative prior distributions unrelated to the optimal estimation. In addition, 
the former puts optimal entropy at the center of the method, which seems de- 
sirable because statistical inference aims to get optimal useful information from 
data. 

In hierarchical Bayesian models, the hyperparameters of prior distributions 
are often determined by maximizing marginal distributions, which are called 
the empirical Bayesian methods. Our method is not suitable for these mod- 
els because hyperparameters are not ones of noninformative priors and can be 
interpreted as additive parameters that complement incompleteness of struc- 
tures of the models. A simila r situation occurs in Bayesian netw ork classifiers 
(BNC s ) | Friedman et al. (1997), as mentioned in our previous work llsozaki et aT 



(|2008l l2009f) . BNCs are being developed in the machine learning domain for 
classification tasks and are generalized from well-known naive Bayes classifiers. 
In the case of BNCs, it is known that their conditional probabili ties play a 



part i n complementing inaccuracies of estimated network structures Uing et al 
<f2005h . 



6 Summary 

Based neither on frequentism nor Bayesianism, a robust method of the prob- 
ability estimation-based on both thermodynamics and information theory-for 
discrete-random- variable systems was developed. The core of the method is the 
intent to obtain optimized entropy explicitly, namely, obtaining optimized infor- 
mation from available limited data. The theory introduces two new quantities: 
information energy and data temperature. Free energy is defined by using these 
quantities. The minimum free energy principle for inference, which unifies the 
maximum likelihood and maximum entropy principles with the above quanti- 
ties, is adopted. The theory has advantages over frequentism because of it is 
more robust for small sample size and over Bayesianism because it does not 
use prior /posterior distributions when no prior knowledge is available, where 
prior biases are regarded as not completely excluded. The effectiveness of the 
method in terms of robustness was demonstrated by simulation studies on point 
estimation for single variable systems with various entropies. 
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Figure 1: KL divergences between true probability mass functions and prob- 
ability mass functions estimated by using maximum likelihood (ML), MFEE, 
maximum a posterior with Jeffreys' prior (Bayes) and maximum entropy (ME). 
The horizontal axes denote sample sizes. H denotes Shannon entropy in natural 
logarithms. 
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